logo

Part 1 – Setup & Data

## Function to install and load packages
install_and_load <- function(package_names) {
  # Check which packages are not installed
  new_packages <- package_names[!(package_names %in% installed.packages()[, "Package"])]
  
  # Install new packages
  if(length(new_packages)) {
    install.packages(new_packages)
  }
  
  # Load all packages
  sapply(package_names, require, character.only = TRUE)
}

# Load in packages
packages <- c("tidyverse", "tidymodels", "ggplot2", "RcppRoll", "vip", "doParallel",
              "xgboost","lme4", "finetune", "PlayerRatings", "gt","gtExtras")

install_and_load(packages)
##     tidyverse    tidymodels       ggplot2      RcppRoll           vip 
##          TRUE          TRUE          TRUE          TRUE          TRUE 
##    doParallel       xgboost          lme4      finetune PlayerRatings 
##          TRUE          TRUE          TRUE          TRUE          TRUE 
##            gt      gtExtras 
##          TRUE          TRUE
# Import relevant csv data
player_data <- read_csv("data/player_game_data.csv", show_col_types = FALSE)
team_data <- read_csv("data/team_game_data.csv", show_col_types = FALSE)

Part 2 – Playoffs Series Modeling

Task is to fit a model that predicts the winner and the number of games in a playoffs series between any given two teams with the following specifications:

  1. The final output must include the probability of each team winning the series. The model must also predict the number of games in the series and can be probabilistic or a point estimate.

  2. Only data available prior to the start of the series can be used e.g. team’s stats from the 2016-17 season can’t be used to predict a playoffs series from the 2015-16 season.

Model Overview

Introduction

To predict the outcome of a series between two teams in the NBA Playoffs (2023-2024 season), you can approach this problem either at a game or series level which have different strengths and weaknesses in their approach.

I have chosen to build a model on the game-level using a powerful model called XGBoost to predict each game of a 7-game series in order to predict the outcome of a series between any two playoff teams in a given playoff round from the 2024 playoffs.

This model uses certain inputs including box score statistics that take a player and/or teams averages over the past 5 games throughout the regular season (known as rolling averages), as well as indicators of fatigue and momentum such as days since last played and days until next game. I have also incorporated team ratings that assesses the strength (both defensive and offensive) of a team accounting for variations and fluctuations between- and within-seasons.

By using these inputs, a user can predict the outcome of a NBA playoff series from the 2023-24 season and can also look at the most likely path that a team would take in the playoffs including advancements and eliminations.

Strengths & Weaknesses

By taking a game-level approach, the model can only predict a playoff series as the sum of the individual games within that series. Predicting at the game level has a distinct advantage over predicting at the series level as it allows us to account for the impact of chance and other related factors to occur across the 7 games as players and team can and do under/over perform. However, by taking this approach the model is not capturing season-level specific-metrics such as league standings, home and away season records directly that might help predict the outcome of a playoff series in one season.

Additionally, this model assumes that each game in a given series are independent (not the same and are unrelated). This would normally be disadvantageous as intuition tells us that in a 7 game series momentum matters and that teams who go up 3-0 have never lost a series. However, by assuming series games are independent, this allows the model to be abstracted beyond the 2023-24 NBA playoffs- i.e. the model can be used to predict games or series given updated game data from both the past and in the future.

The last caveat of using this model is that the skill rating model used to calculate team strength was not directly optimised and therefore may bias dominant teams where the natural decay in rating when teams lose may not occur as quickly for teams who have won in previous seasons and then had large reductions in team performance the next.

Addressing Weaknesses

To address these weaknesses the following could be implemented given more time and/or data: - Tuning the parameters for the team strength model (as it is the strongest predictor of game outcome in our model). - Adding in a specific seasonal component prior so that rolling averages are computer within-season rather than between games across seasons and league/home/away records. - Greater exploration into metrics relating to player and team health given its impact on the 2023-24 NBA playoffs thus far
accounting for players who not only don’t play but are injured. - Exploring different combinations or ensemble models for game outcomes. - If given more data, building a series-level predictor model using a greater sample of historical playoff series outcomes.

Functions & Helpers

# Helper functions ---------------------------

# Function to count games in the last n days
count_games_last_n_days <- function(dates, n_days) {
  sapply(1:length(dates), function(i) {
    if (i == 1) {
      NA
    } else {
      sum(dates[i] - dates[1:(i-1)] <= n_days)
    }
  })
}

# Function to create playoff bracket including 1st Round pre-fill and blank entries for rest of playoffs
create_round_bracket <- function(initial_matchups, round_name, playoff_team_seed) {
  round_num <- round_name
  # Helper function to expand series into games based on home advantage pattern
  expand_series <- function(h_team, a_team, conf_name, round_num) {
    # Define the home team pattern based on the game number
    home_pattern <- c(h_team, h_team, a_team, a_team, h_team, a_team, h_team)
    away_pattern <- c(a_team, a_team, h_team, h_team, a_team, h_team, a_team)
    tibble(
      conference = ifelse(round_num == 4, "Both", conf_name),
      round_number = round_num,
      round_name = c("Round 1", "Round 2", "Conference Finals", "Finals")[round_num],
      game_number = 1:7,
      h_team = home_pattern,
      a_team = away_pattern
    )
  }
  
  # Mapping round names to numbers
  round_map <- c("Round 1" = 1, "Round 2" = 2, "Conference Finals" = 3, "Finals" = 4)
  round_num <- round_map[[round_name]]
  
  # Generate the data frame for the specified round from the initial match ups
  round_bracket <- bind_rows(
    lapply(names(initial_matchups), function(conf_name) {
      bind_rows(
        lapply(names(initial_matchups[[conf_name]]), function(h_team) {
          a_team <- initial_matchups[[conf_name]][[h_team]]
          expand_series(h_team, a_team, conf_name, round_num)
        })
      )
    })
  )
  
  if (round_name == 4) {
    round_bracket <- round_bracket %>%
      mutate(
        season = 2023,
        nbagameid = row_number(),
        gamedate = as.Date('2024-12-31')
      ) %>%
      left_join(
        playoff_team_seed %>% select(team_name,league_seed),
        by = c("h_team" = "team_name")
      ) %>%
      rename("h_seed" = "league_seed") %>%
      left_join(
        playoff_team_seed %>% select(team_name,league_seed),
        by = c("a_team" = "team_name")
      ) %>%
      rename("a_seed" = "league_seed")
    
  } else {
    round_bracket <- round_bracket %>%
      mutate(
        season = 2023,
        nbagameid = row_number(),
        gamedate = as.Date('2024-12-31')
      ) %>%
      left_join(
        playoff_team_seed,
        by = c("h_team" = "team_name")
      ) %>%
      rename("h_seed" = "seed") %>%
      left_join(
        playoff_team_seed,
        by = c("a_team" = "team_name")
      ) %>%
      rename("a_seed" = "seed")
  }
  
  return(round_bracket)
}

# Function to get features for current bracket
bracket_with_features <- function(bracket, most_recent_ratings,most_recent_rolling_features,most_recent_player_features) {
  bracket_features <- bracket %>%
    inner_join(
      most_recent_ratings,
      by = c("h_team" = "team")
    ) %>%
    rename("h_rating" = rating) %>%
    inner_join(
      most_recent_ratings,
      by = c("a_team" = "team")
    ) %>%
    rename("a_rating" = rating) %>%
    inner_join(
      most_recent_rolling_features,
      by = c("h_team" = "team")
    ) %>%
    rename_with(~ paste0("h_", .), fg2made:ft_rate) %>%
    inner_join(
      most_recent_rolling_features,
      by = c("a_team" = "team")
    ) %>%
    rename_with(~ paste0("a_", .), fg2made:ft_rate) %>%
    inner_join(
      most_recent_player_features,
      by = c("h_team" = "team")
    ) %>%
    rename_with(~ paste0("h_", .), mean_oreb_pct:cumulative_unique_lineups) %>%
    inner_join(
      most_recent_player_features,
      by = c("a_team" = "team")
    ) %>%
    rename_with(~ paste0("a_", .), mean_oreb_pct:cumulative_unique_lineups) %>%
    mutate(
      diff_rating = h_rating - a_rating,
      diff_fg2made = h_fg2made - a_fg2made,
      diff_fg2missed = h_fg2missed - a_fg2missed,
      diff_fg2attempted = h_fg2attempted - a_fg2attempted,
      diff_fg3made = h_fg3made - a_fg3made,
      diff_fg3missed = h_fg3missed - a_fg3missed,
      diff_fg3attempted = h_fg3attempted - a_fg3attempted,
      diff_fgmade = h_fgmade - a_fgmade,
      diff_fgmissed = h_fgmissed - a_fgmissed,
      diff_fgattempted = h_fgattempted - a_fgattempted,
      diff_ftmade = h_ftmade - a_ftmade,
      diff_ftmissed = h_ftmissed - a_ftmissed,
      diff_ftattempted = h_ftattempted - a_ftattempted,
      diff_reboffensive = h_reboffensive - a_reboffensive,
      diff_rebdefensive = h_rebdefensive - a_rebdefensive,
      diff_reboundchance = h_reboundchance - a_reboundchance,
      diff_assists = h_assists - a_assists,
      diff_stealsagainst = h_stealsagainst - a_stealsagainst,
      diff_turnovers = h_turnovers - a_turnovers,
      diff_blocksagainst = h_blocksagainst - a_blocksagainst,
      diff_defensivefouls = h_defensivefouls - a_defensivefouls,
      diff_offensivefouls = h_offensivefouls - a_offensivefouls,
      diff_shootingfoulsdrawn = h_shootingfoulsdrawn - a_shootingfoulsdrawn,
      diff_possessions = h_possessions - a_possessions,
      diff_points = h_points - a_points,
      diff_shotattempts = h_shotattempts - a_shotattempts,
      diff_andones = h_andones - a_andones,
      diff_shotattemptpoints = h_shotattemptpoints - a_shotattemptpoints,
      diff_ppa = h_ppa - a_ppa,
      diff_ppp = h_ppp - a_ppp,
      diff_tov_pct = h_tov_pct - a_tov_pct,
      diff_blk_pct = h_blk_pct - a_blk_pct,
      diff_ortg = h_ortg - a_ortg,
      diff_drtg = h_drtg - a_drtg,
      diff_ntrg = h_ntrg - a_ntrg,
      diff_efg_pct = h_efg_pct - a_efg_pct,
      diff_ts_pct = h_ts_pct - a_ts_pct,
      diff_ft_rate = h_ft_rate - a_ft_rate,
      diff_mean_oreb_pct = h_mean_oreb_pct - a_mean_oreb_pct,
      diff_mean_dreb_pct = h_mean_dreb_pct - a_mean_dreb_pct,
      diff_mean_tov_pct = h_mean_tov_pct - a_mean_tov_pct,
      diff_mean_stl_pct = h_mean_stl_pct - a_mean_stl_pct,
      diff_mean_blk_pct = h_mean_blk_pct - a_mean_blk_pct,
      diff_mean_usg_pct = h_mean_usg_pct - a_mean_usg_pct,
      diff_mean_ast_pct = h_mean_ast_pct - a_mean_ast_pct,
      diff_max_usg_pct = h_max_usg_pct - a_max_usg_pct,
      diff_avg_mp_starter = h_avg_mp_starter - a_avg_mp_starter,
      diff_avg_mp_bench = h_avg_mp_bench - a_avg_mp_bench,
      diff_pnts_by_starters = h_pnts_by_starters - a_pnts_by_starters,
      diff_pnts_by_bench = h_pnts_by_bench - a_pnts_by_bench,
      diff_sharp_shooters = h_sharp_shooters - a_sharp_shooters,
      diff_paint_specialists = h_paint_specialists - a_paint_specialists,
      diff_game_score_metric = h_game_score_metric - a_game_score_metric,
    ) %>%
    select(
      conference:a_seed,
      starts_with("diff_"),
      h_cumulative_unique_lineups,
      a_cumulative_unique_lineups
    )
  return(bracket_features)
}

# Function for Conference based playoffs rounds 
run_series <- function(bracket_features,xgb_last) {
  # Create bracket checker
  bracket_checker <- bracket_features %>%
    select(conference:seed_id) %>%
    distinct(conference,round_name,round_number, seed_id) %>%
    mutate(
      winner = NA,
      loser = NA,
      total_games = NA
    )
  
  # Create distinct conference, seed groups
  initial_seed_ids_df <- bracket_features %>%
    distinct(conference,seed_id)
  
  conference_groups <- split(initial_seed_ids_df, initial_seed_ids_df$conference)
  
  # Loop through each conference 
  for(conf in names(conference_groups)) {
    # Extract the current conference data frame
    conference_data <- conference_groups[[conf]]
    
    for(seed in conference_data$seed_id) {
      # Filter rows that match the current series_id
      indv_series <- bracket_features %>%
        filter(seed_id == seed & conference == conf)
      
      u_seed <- indv_series %>% slice_min(order_by = h_seed, n = 1) %>% distinct(h_team) %>% pull(h_team)
      b_seed <- indv_series %>% slice_max(order_by = h_seed, n = 1) %>% distinct(h_team) %>% pull(h_team)
      
      if (length(u_seed) > 1) {
        b_seed <- u_seed[2]
        u_seed <- u_seed[1]
      }
      
      u_seed_wins = 0
      u_seed_losses = 0
      b_seed_wins = 0
      b_seed_losses = 0
      
      for (row_n in 1:nrow(indv_series)) {
        indiv_game = indv_series %>%
          filter(game_number == row_n)
        
        home_team = indiv_game$h_team
        away_team = indiv_game$a_team
        
        pred_winner = predict(
          xgb_last %>% extract_workflow(), 
          new_data = indiv_game,
          type = "prob",
        )
        
        is_home_win = sample(x = c(1, 0), size = 1, replace = TRUE,
                             prob = c(pred_winner$.pred_1, pred_winner$.pred_0))
        
        # Updating the series_tracker based on the game outcome
        if (is_home_win == 1) {
          if (home_team == u_seed)  {
            # Upper seed wins
            u_seed_wins <- u_seed_wins + 1
            b_seed_losses <- b_seed_losses + 1
          } else {
            # Lower seed wins
            b_seed_wins <- b_seed_wins + 1
            u_seed_losses <- u_seed_losses + 1
          }
        } else {
          if (home_team == u_seed)  {
            # Upper seed loses
            b_seed_wins <- b_seed_wins + 1
            u_seed_losses <- u_seed_losses + 1
          } else {
            # Lower seed loses
            u_seed_wins <- u_seed_wins + 1
            b_seed_losses <- b_seed_losses + 1
          }
        }
        
        # Check if either team has won 4 games
        if (u_seed_wins >= 4 || b_seed_wins >= 4) {
          if (u_seed_wins >= 4) {
            # If Upper seed wins they advance
            total_games <- u_seed_wins + u_seed_losses
            bracket_checker$winner[bracket_checker$seed_id == seed & bracket_checker$conference == conf] <- u_seed
            bracket_checker$loser[bracket_checker$seed_id == seed & bracket_checker$conference == conf] <- b_seed
            bracket_checker$total_games[bracket_checker$seed_id == seed & bracket_checker$conference == conf] <- total_games
            
          } else{
            # If Lower seed wins they advance
            total_games <- b_seed_wins + b_seed_losses
            bracket_checker$winner[bracket_checker$seed_id == seed & bracket_checker$conference == conf] <- b_seed
            bracket_checker$loser[bracket_checker$seed_id == seed & bracket_checker$conference == conf] <- u_seed
            bracket_checker$total_games[bracket_checker$seed_id == seed & bracket_checker$conference == conf] <- total_games
            
          }
          break  # Exit the loop
        }
      }
    }
  }
  return(bracket_checker)
}

# Function to ensure that the higher seeded team starts as home team in case of upset
align_bracket_seeding <- function(bracket_with_features) {
  # Determine which match ups need swapping based on the first game
  swap_teams <- bracket_with_features %>%
    filter(game_number == 1) %>%
    mutate(need_swap = h_seed > a_seed) %>%
    select(matchup_id, need_swap)
  
  # Join this back to the original bracket_with_features
  bracket_with_features <- bracket_with_features %>%
    left_join(swap_teams, by = "matchup_id")
  
  bracket_with_features_aligned <- bracket_with_features %>%
    mutate(
      # Swap teams
      h_team_fixed = ifelse(need_swap, a_team, h_team),
      a_team_fixed = ifelse(need_swap, h_team, a_team),
      # Swap seeds
      h_seed_fixed = ifelse(need_swap, a_seed, h_seed),
      a_seed_fixed = ifelse(need_swap, h_seed, a_seed)
    ) %>%
    select(-c(need_swap,a_team, h_team, a_team, h_seed, a_seed)) %>%
    rename(
      "h_team" = h_team_fixed,
      "a_team" = a_team_fixed,
      "h_seed" = h_seed_fixed,
      "a_seed" = a_seed_fixed
    ) %>%
    select(
      conference:game_number,
      h_team,a_team,
      season:gamedate,
      h_seed,a_seed
    )
  
  return(bracket_with_features_aligned)
}

# Function to get all combinations of ECF or WCF conferences
get_cf_potential_matchups <- function() {
  # Define potential winners in the upper and lower brackets
  upper_bracket_winners <- c(1, 4, 5, 8)
  lower_bracket_winners <- c(2, 3, 6, 7)
  
  # Generate all combinations of these winners for the conference finals
  conference_finals_combinations <- expand.grid(upper_bracket = upper_bracket_winners, 
                                                lower_bracket = lower_bracket_winners) %>%
    # Ensure the format "higher seed-lower seed"
    mutate(Conference_Final_Matchup = ifelse(upper_bracket < lower_bracket, 
                                             paste(upper_bracket, lower_bracket, sep = "-"),
                                             paste(lower_bracket, upper_bracket, sep = "-"))) %>%
    # Remove duplicates (as some match ups might repeat with seeds flipping)
    distinct(Conference_Final_Matchup) %>%
    arrange(Conference_Final_Matchup)
  
  return(conference_finals_combinations$Conference_Final_Matchup)
}

# Function run n number of sims using the model and playoff seeding
playoff_sim <- function(sims, xgb_last, playoff_team_seed){
  
  results_list <- list()
  final_series_list <- list()
  
  for (sim_no in 1:sims) { # number of sims to run
    # Get most recent features
    # Ratings
    most_recent_ratings <- hist_ratings %>%
      left_join(
        game_level %>% select(nbagameid,season,gametype),
        by = c("season","nbagameid")
      ) %>%
      filter(season == 2023 & gametype == 2) %>%
      group_by(team) %>%
      top_n(n = 1, wt = nbagameid) %>%
      ungroup() %>%
      select(-season, -nbagameid, -rating_period, -nbagameid_prev, -gametype)
    
    # Rolling features
    most_recent_rolling_features <- rolling_mean_features %>%
      left_join(
        game_level %>% select(nbagameid,season,gametype),
        by = c("season","nbagameid")
      ) %>%
      filter(season == 2023 & gametype == 2) %>%
      group_by(team) %>%
      top_n(n = 1, wt = nbagameid) %>%
      ungroup() %>%
      select(-season, -nbagameid, -gametype, -is_home)
    
    # Player features
    most_recent_player_features <- player_features %>%
      left_join(
        game_level %>% select(nbagameid,season,gametype),
        by = c("season","nbagameid")
      ) %>%
      filter(season == 2023 & gametype == 2) %>%
      group_by(team) %>%
      top_n(n = 1, wt = nbagameid) %>%
      ungroup() %>%
      select(-season, -nbagameid, -gametype)
    
    # Match-ups for 2024 playoff brackets
    # Playoff seeds by team
    playoff_team_seeding <- data.frame(
      team_name = c("BOS", "MIA", "CLE", "ORL", "MIL", "IND", "NYK", "PHI",
                    "OKC", "NOP", "LAC", "DAL", "MIN", "PHX", "DEN", "LAL"),
      seed = c(1, 8, 4, 5, 3, 6, 2, 7,
               1, 8, 4, 5, 3, 6, 2, 7),
      stringsAsFactors = FALSE
    )
    
    # Round 1 initial match ups
    initial_matchups <- list(
      "East" = list("BOS" = "MIA", "CLE" = "ORL", "MIL" = "IND", "NYK" = "PHI"),
      "West" = list("OKC" = "NOP", "LAC" = "DAL", "MIN" = "PHX", "DEN" = "LAL")
    )
    
    # Create the playoff bracket
    initial_playoff_bracket <- create_round_bracket(initial_matchups,1, playoff_team_seed)
    
    # Join in latest features from last regular season games for each team
    initial_bracket_features <- bracket_with_features(initial_playoff_bracket, most_recent_ratings,most_recent_rolling_features,most_recent_player_features) %>%
      # Add match up_id
      mutate(
        matchup_id = pmap_chr(list(h_team, a_team, h_seed, a_seed), ~ {
          teams <- c(..1, ..2)
          seeds <- c(..3, ..4)
          sorted_teams <- teams[order(seeds)]
          paste(sorted_teams, collapse = "-")
        })
      ) %>%
      group_by(matchup_id) %>%
      # Add seed_id
      mutate(
        seed_id = pmap_chr(list(h_seed, a_seed), ~paste(sort(c(...)), collapse = "-"))
      ) %>%
      select(
        conference:a_seed,
        matchup_id,
        seed_id,
        starts_with("diff_"),
        h_cumulative_unique_lineups,
        a_cumulative_unique_lineups
      )
    
    # Initial bracket tracker
    initial_bracket_checker <- run_series(initial_bracket_features,xgb_last)
    
    # Track winners and losers from R1
    e_r1_1_to_8_winner <- initial_bracket_checker$winner[initial_bracket_checker$seed_id == "1-8" & initial_bracket_checker$conference == "East"]
    e_r1_1_to_8_loser <- initial_bracket_checker$loser[initial_bracket_checker$seed_id == "1-8" & initial_bracket_checker$conference == "East"]
    e_r1_4_to_5_winner <- initial_bracket_checker$winner[initial_bracket_checker$seed_id == "4-5" & initial_bracket_checker$conference == "East"]
    e_r1_4_to_5_loser <- initial_bracket_checker$loser[initial_bracket_checker$seed_id == "4-5" & initial_bracket_checker$conference == "East"]
    e_r1_3_to_6_winner <- initial_bracket_checker$winner[initial_bracket_checker$seed_id == "3-6" & initial_bracket_checker$conference == "East"]
    e_r1_3_to_6_loser <- initial_bracket_checker$loser[initial_bracket_checker$seed_id == "3-6" & initial_bracket_checker$conference == "East"]
    e_r1_2_to_7_winner <- initial_bracket_checker$winner[initial_bracket_checker$seed_id == "2-7" & initial_bracket_checker$conference == "East"]
    e_r1_2_to_7_loser <- initial_bracket_checker$loser[initial_bracket_checker$seed_id == "2-7" & initial_bracket_checker$conference == "East"]
    
    w_r1_1_to_8_winner <- initial_bracket_checker$winner[initial_bracket_checker$seed_id == "1-8" & initial_bracket_checker$conference == "West"]
    w_r1_1_to_8_loser <- initial_bracket_checker$loser[initial_bracket_checker$seed_id == "1-8" & initial_bracket_checker$conference == "West"]
    w_r1_4_to_5_winner <- initial_bracket_checker$winner[initial_bracket_checker$seed_id == "4-5" & initial_bracket_checker$conference == "West"]
    w_r1_4_to_5_loser <- initial_bracket_checker$loser[initial_bracket_checker$seed_id == "4-5" & initial_bracket_checker$conference == "West"]
    w_r1_3_to_6_winner <- initial_bracket_checker$winner[initial_bracket_checker$seed_id == "3-6" & initial_bracket_checker$conference == "West"]
    w_r1_3_to_6_loser <- initial_bracket_checker$loser[initial_bracket_checker$seed_id == "3-6" & initial_bracket_checker$conference == "West"]
    w_r1_2_to_7_winner <- initial_bracket_checker$winner[initial_bracket_checker$seed_id == "2-7" & initial_bracket_checker$conference == "West"]
    w_r1_2_to_7_loser <- initial_bracket_checker$loser[initial_bracket_checker$seed_id == "2-7" & initial_bracket_checker$conference == "West"]
    
    # print("R1 successful")
    
    # Get R2 match ups
    r2_matchups <- list(
      "East" = setNames(list(e_r1_4_to_5_winner, e_r1_3_to_6_winner),
                        c(e_r1_1_to_8_winner, e_r1_2_to_7_winner)),
      "West" = setNames(list(w_r1_4_to_5_winner, w_r1_3_to_6_winner),
                        c(w_r1_1_to_8_winner, w_r1_2_to_7_winner))
    )
    
    # Get R2 bracket
    r2_playoff_bracket <- create_round_bracket(r2_matchups, 2, playoff_team_seed) %>%
      # Add matchup_id for alignment purposes
      mutate(
        matchup_id = pmap_chr(list(h_team, a_team), ~paste(sort(c(...)), collapse = "-"))
      )
    
    # Align bracket before adding features
    r2_playoff_bracket_aligned <- align_bracket_seeding(r2_playoff_bracket)
    
    # Add features to R2 bracket
    r2_playoff_bracket_features <- bracket_with_features(r2_playoff_bracket_aligned, most_recent_ratings,most_recent_rolling_features,most_recent_player_features) %>%
      # Add matchup_id
      mutate(
        matchup_id = pmap_chr(list(h_team, a_team, h_seed, a_seed), ~ {
          teams <- c(..1, ..2)
          seeds <- c(..3, ..4)
          sorted_teams <- teams[order(seeds)]
          paste(sorted_teams, collapse = "-")
        })
      ) %>%
      group_by(matchup_id) %>%
      # Add seed_id
      mutate(
        seed_id = pmap_chr(list(h_seed, a_seed), ~paste(sort(c(...)), collapse = "-"))
      ) %>%
      select(
        conference:a_seed,
        matchup_id,
        seed_id,
        starts_with("diff_"),
        h_cumulative_unique_lineups,
        a_cumulative_unique_lineups
      )
    
    # R2 results
    r2_bracket_checker <- run_series(r2_playoff_bracket_features,xgb_last)
    
    # Track winners and losers from R2
    e_r2_u_winner <- r2_bracket_checker$winner[
      (r2_bracket_checker$seed_id == "1-4" | r2_bracket_checker$seed_id == "1-5" | 
         r2_bracket_checker$seed_id == "4-8" | r2_bracket_checker$seed_id == "5-8") &
        r2_bracket_checker$conference == "East"]
    
    e_r2_u_loser <- r2_bracket_checker$loser[
      (r2_bracket_checker$seed_id == "1-4" | r2_bracket_checker$seed_id == "1-5" | 
         r2_bracket_checker$seed_id == "4-8" | r2_bracket_checker$seed_id == "5-8") &
        r2_bracket_checker$conference == "East"]
    
    e_r2_l_winner <- r2_bracket_checker$winner[
      (r2_bracket_checker$seed_id == "2-3" | r2_bracket_checker$seed_id == "2-6" | 
         r2_bracket_checker$seed_id == "3-7" | r2_bracket_checker$seed_id == "6-7") &
        r2_bracket_checker$conference == "East"]
    
    e_r2_l_loser <- r2_bracket_checker$loser[
      (r2_bracket_checker$seed_id == "2-3" | r2_bracket_checker$seed_id == "2-6" | 
         r2_bracket_checker$seed_id == "3-7" | r2_bracket_checker$seed_id == "6-7") &
        r2_bracket_checker$conference == "East"]
    
    w_r2_u_winner <- r2_bracket_checker$winner[
      (r2_bracket_checker$seed_id == "1-4" | r2_bracket_checker$seed_id == "1-5" | 
         r2_bracket_checker$seed_id == "4-8" | r2_bracket_checker$seed_id == "5-8") &
        r2_bracket_checker$conference == "West"]
    
    w_r2_u_loser <- r2_bracket_checker$loser[
      (r2_bracket_checker$seed_id == "1-4" | r2_bracket_checker$seed_id == "1-5" | 
         r2_bracket_checker$seed_id == "4-8" | r2_bracket_checker$seed_id == "5-8") &
        r2_bracket_checker$conference == "West"]
    
    w_r2_l_winner <- r2_bracket_checker$winner[
      (r2_bracket_checker$seed_id == "2-3" | r2_bracket_checker$seed_id == "2-6" | 
         r2_bracket_checker$seed_id == "3-7" | r2_bracket_checker$seed_id == "6-7") &
        r2_bracket_checker$conference == "West"]
    
    w_r2_l_loser <- r2_bracket_checker$loser[
      (r2_bracket_checker$seed_id == "2-3" | r2_bracket_checker$seed_id == "2-6" | 
         r2_bracket_checker$seed_id == "3-7" | r2_bracket_checker$seed_id == "6-7") &
        r2_bracket_checker$conference == "West"]
    
    # print("R2 successful")
    
    # Get Conference finals match ups
    r3_matchups <- list(
      "East" = setNames(list(e_r2_l_winner),
                        c(e_r2_u_winner)),
      "West" = setNames(list(w_r2_l_winner),
                        c(w_r2_u_winner))
    )
    
    # Get Conference finals bracket
    r3_playoff_bracket <- create_round_bracket(r3_matchups, 3, playoff_team_seed) %>%
      # Add matchup_id for alignment purposes
      mutate(
        matchup_id = pmap_chr(list(h_team, a_team), ~paste(sort(c(...)), collapse = "-"))
      )
    
    # Align seeding for R2 bracket
    r3_playoff_bracket_aligned <- align_bracket_seeding(r3_playoff_bracket)
    
    # Add features to R2 bracket
    r3_playoff_bracket_features <- bracket_with_features(r3_playoff_bracket_aligned, most_recent_ratings,most_recent_rolling_features,most_recent_player_features) %>%
      # Add matchup_id
      mutate(
        matchup_id = pmap_chr(list(h_team, a_team, h_seed, a_seed), ~ {
          teams <- c(..1, ..2)
          seeds <- c(..3, ..4)
          sorted_teams <- teams[order(seeds)]
          paste(sorted_teams, collapse = "-")
        })
      ) %>%
      group_by(matchup_id) %>%
      # Add seed_id
      mutate(
        seed_id = pmap_chr(list(h_seed, a_seed), ~paste(sort(c(...)), collapse = "-"))
      ) %>%
      select(
        conference:a_seed,
        matchup_id,
        seed_id,
        starts_with("diff_"),
        h_cumulative_unique_lineups,
        a_cumulative_unique_lineups
      )
    
    # Conference finals results
    r3_bracket_checker <- run_series(r3_playoff_bracket_features,xgb_last)
    
    # Set R4 match ups
    cf_potential_matchups <- get_cf_potential_matchups()
    
    # Track winners and losers for R4 bracket
    e_r3_winner <- r3_bracket_checker %>%
      filter(seed_id %in% cf_potential_matchups &
               conference == "East") %>%
      pull(winner)
    
    e_r3_loser <- r3_bracket_checker %>%
      filter(seed_id %in% cf_potential_matchups &
               conference == "East") %>%
      pull(loser)
    
    w_r3_winner <- r3_bracket_checker %>%
      filter(seed_id %in% cf_potential_matchups &
               conference == "West") %>%
      pull(winner)
    
    w_r3_loser <- r3_bracket_checker %>%
      filter(seed_id %in% cf_potential_matchups &
               conference == "West") %>%
      pull(loser)
    
    # print("R3 successful")
    
    # Get Finals match ups
    r4_matchups <- list(
      "Both" = setNames(list(e_r3_winner),
                        c(w_r3_winner))
    )
    
    # Get Finals bracket
    r4_playoff_bracket <- create_round_bracket(r4_matchups, 4, playoff_team_seed) %>%
      # Add matchup_id for alignment purposes
      mutate(
        matchup_id = pmap_chr(list(h_team, a_team), ~paste(sort(c(...)), collapse = "-"))
      )
    
    # Align seeding for Finals bracket
    r4_playoff_bracket_aligned <- align_bracket_seeding(r4_playoff_bracket)
    
    # Add features to Finals bracket
    r4_playoff_bracket_features <- bracket_with_features(r4_playoff_bracket_aligned, most_recent_ratings,most_recent_rolling_features,most_recent_player_features) %>%
      # Add matchup_id
      mutate(
        matchup_id = pmap_chr(list(h_team, a_team, h_seed, a_seed), ~ {
          teams <- c(..1, ..2)
          seeds <- c(..3, ..4)
          sorted_teams <- teams[order(seeds)]
          paste(sorted_teams, collapse = "-")
        })
      ) %>%
      group_by(matchup_id) %>%
      # Add seed_id
      mutate(
        seed_id = pmap_chr(list(h_seed, a_seed), ~paste(sort(c(...)), collapse = "-"))
      ) %>%
      select(
        conference:a_seed,
        matchup_id,
        seed_id,
        starts_with("diff_"),
        h_cumulative_unique_lineups,
        a_cumulative_unique_lineups
      )
    
    # Finals results
    r4_bracket_checker <- run_series(r4_playoff_bracket_features,xgb_last)
    
    # Track winners and losers for Final
    f_r4_winner <- r4_bracket_checker %>% 
      filter(conference == "Both") %>%
      pull(winner)
    
    f_r4_loser <- r4_bracket_checker %>% 
      filter(conference == "Both") %>%
      pull(loser)
    
    # print("R4 successful")

    # Add all series results to res list
    results_list[[sim_no]] <- bind_rows(initial_bracket_checker, r2_bracket_checker, r3_bracket_checker, r4_bracket_checker)
    results_list[[sim_no]]$sim_num <- sim_no
    
    final_series_list[[sim_no]] <- data.frame(
      simulation_id = rep(sim_no, 16),
      conference = c(
        "East","East","East","East","West","West","West","West",
        "East","East","West","West",
        "East" , "West",
        "Both",
        "Both"
      ),
      round_made = c(
        1,1,1,1,1,1,1,1,
        2,2,2,2,
        3,3,
        4,
        5
      ),
      team_name = c(
        e_r1_1_to_8_winner,e_r1_4_to_5_winner,e_r1_3_to_6_winner,e_r1_2_to_7_winner,w_r1_1_to_8_winner,w_r1_4_to_5_winner,w_r1_3_to_6_winner,w_r1_2_to_7_winner,
        e_r2_u_winner,e_r2_l_winner,w_r2_u_winner,w_r2_l_winner,
        e_r3_winner,w_r3_winner,
        f_r4_winner,
        f_r4_winner
      )
    )
    # print(paste("Finished Sim:",sim_no))
  }
  
  results <- bind_rows(results_list)
  final_series <- bind_rows(final_series_list)
  
  resultdf <- list('results' = results, 'final_series' = final_series)
  print("All sims complete!")
  
  return(resultdf)
}

# Function to get probability of winning between 2 teams in a given round
get_series_prediction_2024 <- function(round, team1, team2, type = "Point Estimate", playoff_team_seed=playoff_team_seeding, sim_results=all_results) {
  # Set league seeds for both inputted teams
  team1_seed <- playoff_team_seed %>% filter(team_name == team1) %>% pull(league_seed)
  team2_seed <-playoff_team_seed %>% filter(team_name == team2) %>% pull(league_seed)
  
  # Get match up in format as sim results
  if (team1_seed < team2_seed) {
    matchup = paste0(team1,'-',team2)
  } else {
    matchup = paste0(team2,'-',team1)
  }
  
  # Adjust round text output for Finals series
  if (round == "Finals") {
    round_txt = "the Finals"
  } else {
    round_txt = round
  }
  
  if (type == "Point Estimate") {
    title <- "Series Win- Point Estimate"
    
    # Get probability of each team winning overall across rounds 
    results_summary <- sim_results %>%
      filter(
        matchup_id == matchup
        & round_name == round
      ) %>%
      group_by(matchup_id, round_number, round_name,winner) %>%
      summarise(
        win_count = n(), # Count the number of times each team has won
        avg_total_games = mean(total_games), # Average number of total games played
        .groups = 'drop'
      ) %>%
      group_by(matchup_id, round_number) %>%
      mutate(
        win_pct = win_count / sum(win_count) # Calculate win percentage
      ) %>%
      ungroup() %>%
      # Join the logo URLs with the main data frame
      left_join(logo_mapping, by = c("winner" = "team_name")) %>%
      # Drop unnecessary cols
      select(logo_url, win_pct, avg_total_games)
    
    # Create the table using gt
    res_table <- gt(results_summary) %>%
      tab_header(
        title = title,
        subtitle = paste("Simulated",round,"series games between", team1, "and", team2)
      ) %>%
      cols_label(
        logo_url = "Team",
        win_pct = "Win %",
        avg_total_games = "Avg # of Games",
      ) %>%
      fmt_percent(
        columns = c(win_pct),
        decimals = 1
      ) %>%
      fmt_number(
        columns = c(avg_total_games),
        decimals = 1
      ) %>%
      tab_options(table.width = pct(40)) %>%
      gt_img_rows(logo_url) %>%
      tab_source_note("The % chance that a team wins a series in a given, average number of games.") %>%
      gt_theme_538() %>%
      tab_style(
        style = cell_text(align = 'center'),
        locations = cells_column_labels(columns = everything())
      ) %>%
      tab_style(
        style = cell_text(align = 'center'),
        locations = cells_body(columns = everything())
      )
  } else {
    # Get probability of each team winning by n games
    title <- "Series Win- Probabilistic"
    
    results_summary <- sim_results %>%
      filter(
        matchup_id == matchup
        & round_name == round
      ) %>%
      group_by(matchup_id, round_number, round_name,total_games,winner) %>%
      summarise(
        win_count = n(), # Count the number of times each team has won
        .groups = 'drop'
      ) %>%
      group_by(matchup_id, round_number,total_games) %>%
      mutate(
        win_pct = win_count / sum(win_count) # Calculate win percentage
      ) %>%
      ungroup() %>%
      # Join the logo URLs with the main data frame
      left_join(logo_mapping, by = c("winner" = "team_name")) %>%
      # Drop unnecessary cols
      select(logo_url, win_pct,total_games)
      
    # Create the table using gt
    res_table <- gt(results_summary) %>%
      tab_header(
        title = title,
        subtitle = paste("Simulated",round,"series games between", team1, "and", team2)
      ) %>%
      cols_label(
        logo_url = "Team",
        win_pct = "Win %",
        total_games = "# of Games",
      ) %>%
      fmt_percent(
        columns = c(win_pct),
        decimals = 1
      ) %>%
      tab_options(table.width = pct(40)) %>%
      gt_img_rows(logo_url) %>%
      tab_source_note("The % chance that a team wins a series when playing a given total number of games.") %>%
      gt_theme_538() %>%
      tab_style(
        style = cell_text(align = 'center'),
        locations = cells_column_labels(columns = everything())
      ) %>%
      tab_style(
        style = cell_text(align = 'center'),
        locations = cells_body(columns = everything())
      )
  }
  return(res_table)
}

Static Variables

# Static variables ---------------------------

# Logo URLs mapped to team names
logo_mapping <- data.frame(
  team_name = c("BOS", "MIA", "CLE", "ORL", "MIL", "IND", "NYK", "PHI", "OKC", "NOP", "LAC", "DAL", "MIN", "PHX", "DEN", "LAL"),
  logo_url = c(
    "https://content.sportslogos.net/logos/6/213/thumbs/slhg02hbef3j1ov4lsnwyol5o.gif",
    "https://content.sportslogos.net/logos/6/214/thumbs/burm5gh2wvjti3xhei5h16k8e.gif",
    "https://content.sportslogos.net/logos/6/222/thumbs/22253692023.gif",
    "https://content.sportslogos.net/logos/6/217/thumbs/wd9ic7qafgfb0yxs7tem7n5g4.gif",
    "https://content.sportslogos.net/logos/6/225/thumbs/22582752016.gif",
    "https://content.sportslogos.net/logos/6/224/thumbs/22448122018.gif",
    "https://content.sportslogos.net/logos/6/216/thumbs/21671702024.gif",
    "https://content.sportslogos.net/logos/6/218/thumbs/21870342016.gif",
    "https://content.sportslogos.net/logos/6/2687/thumbs/khmovcnezy06c3nm05ccn0oj2.gif",
    "https://content.sportslogos.net/logos/6/4962/thumbs/496292922024.gif",
    "https://content.sportslogos.net/logos/6/236/thumbs/23655422025.gif",
    "https://content.sportslogos.net/logos/6/228/thumbs/22834632018.gif",
    "https://content.sportslogos.net/logos/6/232/thumbs/23296692018.gif",
    "https://content.sportslogos.net/logos/6/238/thumbs/23843702014.gif",
    "https://content.sportslogos.net/logos/6/229/thumbs/22989262019.gif",
    "https://content.sportslogos.net/logos/6/237/thumbs/23773242024.gif"
  )
)

# Seeding for playoffs
playoff_team_seeding <- data.frame(
  team_name = c("BOS", "MIA", "CLE", "ORL", "MIL", "IND", "NYK", "PHI",
                "OKC", "NOP", "LAC", "DAL", "MIN", "PHX", "DEN", "LAL"),
  seed = c(
    1, 8, 4, 5, 3, 6, 2, 7,
    1, 8, 4, 5, 3, 6, 2, 7
  ),
  league_seed = c(
    1, 16, 11, 12, 8, 15, 7, 14,
    2, 9, 5, 6, 4, 10, 3, 13
  ),
  stringsAsFactors = FALSE
)

Initial Setup

# Initial setup ---------------------------

# Get game level data
game_level <- team_data %>%
  filter(season >= 2014 & off_home == 1) %>%
  arrange(season, gamedate, nbagameid) %>%
  mutate(gamedate = as.Date(gamedate)) %>%
  select(season:gamedate,off_team,off_win,fg2made:shotattemptpoints) %>%
  rename_with(~ paste0("h_", .), fg2made:shotattemptpoints) %>%
  rename("h_team" = off_team, "is_win" = off_win) %>%
  inner_join(team_data %>%
               filter(season >= 2014 & off_home == 0) %>%
               arrange(season, gamedate, nbagameid) %>%
               mutate(gamedate = as.Date(gamedate)) %>%
               select(season,nbagameid,off_team,fg2made:shotattemptpoints) %>%
               rename_with(~ paste0("a_", .), fg2made:shotattemptpoints) %>%
               rename("a_team" = off_team),
             by = c("season","nbagameid")
  ) %>%
  select(season:h_team,a_team,is_win,h_fg2made:h_shotattemptpoints,a_fg2made:a_shotattemptpoints)

Feature Development

Team Features

# Team Features ---------------------------

# Advanced Box score Metrics
game_level <- game_level %>%
  # Offensive advanced team stats
  mutate(
    h_ppa = h_shotattemptpoints/h_shotattempts, # Points per attempt
    h_ppp = h_shotattemptpoints/h_possessions, # Points per possession
    h_tov_pct = h_turnovers/(h_shotattempts + h_turnovers), # Turnover %
    h_blk_pct = a_blocksagainst/a_fg2attempted, # Block %
    h_ortg = h_points/(h_possessions/100), # Offensive Rating
    h_drtg = a_points/(a_possessions/100), # Defensive Rating
    h_ntrg = h_ortg - h_drtg, # Net Rating
    h_efg_pct = (h_fgmade + (0.5 * h_fg3made)) / (h_fgattempted * 100), # Effective Field Goal %
    h_ts_pct = h_points / (2 * (h_fgattempted + .475 * h_ftattempted)), # True Shooting %
    h_ft_rate = h_ftmade / h_fgattempted, # Free Throw Rate
  ) %>%
  # Defensive advanced team stats
  mutate(
    a_ppa = a_shotattemptpoints/a_shotattempts, # Points per attempt
    a_ppp = a_shotattemptpoints/a_possessions, # Points per possession
    a_tov_pct = a_turnovers/(a_shotattempts + a_turnovers), # Turnover %
    a_blk_pct = h_blocksagainst/h_fg2attempted, # Block %
    a_ortg = a_points/(a_possessions/100), # Offensive Rating
    a_drtg = h_points/(h_possessions/100), # Defensive Rating
    a_ntrg = a_ortg - a_drtg, # Net Rating
    a_efg_pct = (a_fgmade + (0.5 * a_fg3made)) / (a_fgattempted * 100), # Effective Field Goal %
    a_ts_pct = a_points / (2 * (a_fgattempted + .475 * a_ftattempted)), # True Shooting %
    a_ft_rate = a_ftmade / a_fgattempted, # Free Throw Rate
  )

# Rolling Averages
# Need to convert back to origin 2 row per match df structure
team_level <- game_level %>%
  mutate(h_is_home = 1) %>%
  select(season,nbagameid, gamedate, h_team, h_is_home, h_fg2made:h_shotattemptpoints, h_ppa:h_ft_rate) %>%
  rename_with(~ str_remove_all(., "h_"), h_team:h_ft_rate) %>%
  bind_rows (
    game_level %>% 
      mutate(a_is_home = 0) %>%
      select(season,nbagameid, gamedate, a_team, a_is_home, a_fg2made:a_shotattemptpoints, a_ppa:a_ft_rate) %>%
      rename_with(~ str_remove_all(., "a_"), a_team:a_ft_rate)
  ) %>%
  arrange(season, nbagameid)

# Get rolling avg for box score and advanced stats
rolling_mean_features <- team_level %>%
  mutate_at(
    vars(fg2made:ft_rate), # Columns for which we want a rolling mean
    .funs = ~ roll_mean(., 5, align = "right", fill = NA) # Rolling mean for last 5 games 
  ) %>%
  ungroup() %>%
  select(season, nbagameid, team, is_home, fg2made:ft_rate) %>%
  filter(!is.na(fg2made))


# Time/Date Features
# Calculate days since last game and days until next game
days_since_stats <- team_level %>%
  select(season, nbagameid, gamedate, team) %>%
  arrange(season, team, gamedate) %>% # Arrange by season, team, and date
  group_by(season, team) %>% # Group by season and team
  mutate(
    days_since_last_game = c(0, diff(gamedate)),  # Calculate days since last game
    days_until_next_game = as.integer(lead(gamedate) - gamedate) # Calculate days until next game
  ) %>%
  mutate(
    # Reset the days since last game for the first game of each season
    days_since_last_game = if_else(row_number() == 1, NA, days_since_last_game),
    days_until_next_game = if_else(row_number() == n(), NA, as.integer(days_until_next_game))
  ) %>%
  ungroup()

Player Features

# Player Features ---------------------------

# Get player level game data
player_level <- player_data %>%
  filter(season >= 2014) %>%
  mutate(gamedate = as.Date(gamedate)) %>%
  arrange(season, gamedate, nbagameid, nbateamid)

# Get number of players injured per team per match
player_features <- player_level %>%
  # Calculate various player-specific percentages and metrics
  mutate(
    oreb_pct = reboffensive / offensivereboundchances,  # Offensive rebound percentage
    dreb_pct = rebdefensive / defensivereboundchances,  # Defensive rebound percentage
    tov_pct = turnovers / (fgattempted + turnovers),  # Turnover percentage
    stl_pct = replace(steals / defensivepossessions, is.infinite(steals / defensivepossessions), NA),  # Steal percentage, handling infinite values
    blk_pct = replace(blocks / opponentteamfg2attempted, is.infinite(blocks / opponentteamfg2attempted), NA),  # Block percentage, handling infinite values
    usg_pct = (shotattempts + turnovers) / (teamshotattempts + teamturnovers),  # Usage percentage
    ast_pct = assists / (teamfgmade - (fg3made + fg2made)),  # Assist percentage
    pnt3_pct = fg3made / fg3attempted,  # 3-point success percentage
    pnt2_pct = fg2made / fg2attempted,  # 2-point success percentage
    h_ast_pct = assists / (fgattempted + (0.475 * (ftattempted + assists + turnovers))),  # Hybrid assist percentage
    game_score_metric = points + (0.4 * fgmade) - (0.7 * fgattempted) - (0.4 * (ftattempted - ftmade)) + (0.7 * reboffensive) + (0.3 * rebdefensive) + steals + (0.7 * assists) + (0.7 * blocks) - (0.4 * ((defensivefouls + offensivefouls) - turnovers))  # Game score metric calculation
  ) %>%
  # Group by season, game, and team for summary statistics
  group_by(season, nbagameid, team) %>%
  summarise(
    mean_oreb_pct = mean(oreb_pct, na.rm = TRUE),
    mean_dreb_pct = mean(dreb_pct, na.rm = TRUE),
    mean_tov_pct = mean(tov_pct, na.rm = TRUE),
    mean_stl_pct = mean(stl_pct, na.rm = TRUE),
    mean_blk_pct = mean(blk_pct, na.rm = TRUE),
    mean_usg_pct = mean(usg_pct, na.rm = TRUE),
    mean_ast_pct = mean(ast_pct, na.rm = TRUE),
    max_usg_pct = max(usg_pct, na.rm = TRUE), # Max usage % as a proxy for teams reliant on star players for success
    inj_players = sum(missed),  # Total injured players
    avg_mp_starter = mean(seconds[starter == 1], na.rm = TRUE) / 60,  # Average minutes played by starters
    avg_mp_bench = mean(seconds[starter == 0], na.rm = TRUE) / 60,  # Average minutes played by bench players
    pnts_by_starters = sum(points[starter == 1], na.rm = TRUE),  # Points by starters
    pnts_by_bench = sum(points[starter == 0], na.rm = TRUE),  # Points by bench
    sharp_shooters = sum(pnt3_pct > 0.35, na.rm = TRUE),  # Count of sharp shooters
    paint_specialists = sum(pnt2_pct > 0.50, na.rm = TRUE),  # Count of paint specialists
    game_score_metric = mean(game_score_metric, na.rm = TRUE),
    .groups = 'drop'
  ) %>%
  # Join with data to track unique lineups over time
  inner_join(
    player_level %>%
      filter(starter == 1) %>%
      arrange(season, team, nbagameid, nbapersonid) %>%
      group_by(season, team, nbagameid) %>%
      summarise(lineup = paste(nbapersonid, collapse = "-"), .groups = 'drop') %>%
      ungroup() %>%
      group_by(season, team) %>%
      arrange(season, team, nbagameid) %>%
      # Track unique lineups by cumulative count of first occurrences
      mutate(
        cumulative_unique_lineups = cumsum(!duplicated(lineup))
      ) %>%
      ungroup() %>%
      select(-lineup),
    by = c("season", "nbagameid", "team")
  )

Team Strength Feature

# Team Strength Features ---------------------------

# Build and run the glicko-2. rating system with set parameters
glicko2_model <- glicko2(
  game_level %>% arrange(season,nbagameid) %>% mutate(nbagameid = row_number()) %>% select(nbagameid,h_team,a_team,is_win),
  status = NULL,
  init = c(2200,250,0.03),
  tau = 1.2,
  history = TRUE
)

# Get historical ratings for each game in training data
hist_ratings <- glicko2_model[2] %>%
  # Convert the matrix to a data frame
  as.data.frame() %>%
  # Add row names as a column for team names
  rownames_to_column(var = "team") %>%
  # Pivot data longer to transform the data from wide to long format
  pivot_longer(
    cols = -team,
    names_to = "rating_period",
    values_to = "rating"
  ) %>%
  # Filter columns that end with '.Lag'
  filter(endsWith(rating_period, ".Lag")) %>%
  # Extract numbers from 'rating_period' strings
  mutate(rating_period = str_extract(rating_period, "\\d+")) %>%
  # Temporarily rename the 'rating' column for lag identification
  rename(is_lag = rating) %>%
  # Join with the main ratings from the Glicko2 model
  left_join(
    glicko2_model[2] %>%
      as.data.frame() %>%
      # Add row names as a column for team names
      rownames_to_column(var = "team") %>%
      # Pivot data longer to transform the data from wide to long format
      pivot_longer(
        cols = -team,
        names_to = "rating_period",
        values_to = "rating"
      ) %>%
      # Filter columns that end with '.Rating'
      filter(endsWith(rating_period, ".Rating")) %>%
      # Extract numbers from 'rating_period' strings
      mutate(rating_period = str_extract(rating_period, "\\d+")),
    by = c("team", "rating_period")
  ) %>%
  # Filter for entries where 'is_lag' is zero and 'rating' is not the initial value (2200)
  filter(is_lag == 0 & rating != 2200) %>%
  # Convert 'rating_period' to numeric for sorting
  mutate(rating_period = as.numeric(rating_period)) %>%
  # Remove the 'is_lag' column
  select(-is_lag) %>%
  # Arrange by 'rating_period' to ensure chronological order
  arrange(rating_period) %>%
  # Map game and season IDs from another data set
  mutate(
    nbagameid = team_level$nbagameid,
    season = team_level$season
  ) %>%
  # Group by 'season' and 'team' to handle game-level data
  group_by(season, team) %>%
  arrange(nbagameid) %>%
  # Create a lagged 'nbagameid' to link ratings to specific games
  mutate(nbagameid_prev = lag(nbagameid)) %>%
  ungroup() %>%
  # Filter out any missing values in 'nbagameid_prev'
  filter(!is.na(nbagameid_prev))

Combine Features

# Combine features ---------------------------

# Last game look-up helper
last_game_lookup <- team_level %>%
  rename(team = team) %>%
  group_by(season, team) %>%
  arrange(nbagameid) %>%
  mutate(nbagameid_prev = lag(nbagameid)) %>%
  select(season, team, nbagameid, nbagameid_prev) %>% 
  filter(!is.na(nbagameid_prev))

# Preparing the features data frame by joining game data with historical team ratings and player statistics
features <- game_level %>%
  select(season:is_win) %>%
  arrange(season,nbagameid) %>%
  # Join in last game look-up df
  inner_join( 
    last_game_lookup,
    by = c(
      "nbagameid" = "nbagameid",
      "h_team" = "team",
      "season" = "season"
    )
  ) %>%
  rename("h_nbagameid_prev" = nbagameid_prev) %>%
  inner_join(
    last_game_lookup,
    by = c(
      "nbagameid" = "nbagameid",
      "a_team" = "team",
      "season" = "season"
    )
  ) %>%
  rename("a_nbagameid_prev" = nbagameid_prev) %>%
  # Join in rating system feature
  inner_join(
    hist_ratings %>% arrange(rating_period),
    by = c(
      "season" = "season",
      "h_nbagameid_prev" = "nbagameid",
      "h_team" = "team"
    )
  ) %>%
  rename("h_rating" = rating) %>%
  inner_join(
    hist_ratings %>% arrange(rating_period),
    by = c(
      "season" = "season",
      "a_nbagameid_prev" = "nbagameid",
      "a_team" = "team"
    )
  ) %>%
  rename("a_rating" = rating) %>%
  # Join in rolling mean features
  inner_join(
    rolling_mean_features %>% select(-is_home),
    by = c(
      "season" = "season",
      "h_nbagameid_prev" = "nbagameid",
      "h_team" = "team"
    )
  ) %>%
  rename_with(~ paste0("h_", .), fg2made:ft_rate) %>%
  inner_join(
    rolling_mean_features %>% select(-is_home),
    by = c(
      "season" = "season",
      "a_nbagameid_prev" = "nbagameid",
      "a_team" = "team"
    )
  ) %>%
  rename_with(~ paste0("a_", .), fg2made:ft_rate) %>%
  # Join in player features
  inner_join(
    player_features,
    by = c(
      "season" = "season",
      "h_nbagameid_prev" = "nbagameid",
      "h_team" = "team"
    )
  ) %>%
  rename_with(~ paste0("h_", .), mean_oreb_pct:cumulative_unique_lineups) %>%
  inner_join(
    player_features,
    by = c(
      "season" = "season",
      "a_nbagameid_prev" = "nbagameid",
      "a_team" = "team"
    )
  ) %>%
  rename_with(~ paste0("a_", .), mean_oreb_pct:cumulative_unique_lineups) %>%
  # Reduce number of features by finding difference between home and away teams
  mutate(
    diff_rating = h_rating - a_rating,
    diff_fg2made = h_fg2made - a_fg2made,
    diff_fg2missed = h_fg2missed - a_fg2missed,
    diff_fg2attempted = h_fg2attempted - a_fg2attempted,
    diff_fg3made = h_fg3made - a_fg3made,
    diff_fg3missed = h_fg3missed - a_fg3missed,
    diff_fg3attempted = h_fg3attempted - a_fg3attempted,
    diff_fgmade = h_fgmade - a_fgmade,
    diff_fgmissed = h_fgmissed - a_fgmissed,
    diff_fgattempted = h_fgattempted - a_fgattempted,
    diff_ftmade = h_ftmade - a_ftmade,
    diff_ftmissed = h_ftmissed - a_ftmissed,
    diff_ftattempted = h_ftattempted - a_ftattempted,
    diff_reboffensive = h_reboffensive - a_reboffensive,
    diff_rebdefensive = h_rebdefensive - a_rebdefensive,
    diff_reboundchance = h_reboundchance - a_reboundchance,
    diff_assists = h_assists - a_assists,
    diff_stealsagainst = h_stealsagainst - a_stealsagainst,
    diff_turnovers = h_turnovers - a_turnovers,
    diff_blocksagainst = h_blocksagainst - a_blocksagainst,
    diff_defensivefouls = h_defensivefouls - a_defensivefouls,
    diff_offensivefouls = h_offensivefouls - a_offensivefouls,
    diff_shootingfoulsdrawn = h_shootingfoulsdrawn - a_shootingfoulsdrawn,
    diff_possessions = h_possessions - a_possessions,
    diff_points = h_points - a_points,
    diff_shotattempts = h_shotattempts - a_shotattempts,
    diff_andones = h_andones - a_andones,
    diff_shotattemptpoints = h_shotattemptpoints - a_shotattemptpoints,
    diff_ppa = h_ppa - a_ppa,
    diff_ppp = h_ppp - a_ppp,
    diff_tov_pct = h_tov_pct - a_tov_pct,
    diff_blk_pct = h_blk_pct - a_blk_pct,
    diff_ortg = h_ortg - a_ortg,
    diff_drtg = h_drtg - a_drtg,
    diff_ntrg = h_ntrg - a_ntrg,
    diff_efg_pct = h_efg_pct - a_efg_pct,
    diff_ts_pct = h_ts_pct - a_ts_pct,
    diff_ft_rate = h_ft_rate - a_ft_rate,
    diff_mean_oreb_pct = h_mean_oreb_pct - a_mean_oreb_pct,
    diff_mean_dreb_pct = h_mean_dreb_pct - a_mean_dreb_pct,
    diff_mean_tov_pct = h_mean_tov_pct - a_mean_tov_pct,
    diff_mean_stl_pct = h_mean_stl_pct - a_mean_stl_pct,
    diff_mean_blk_pct = h_mean_blk_pct - a_mean_blk_pct,
    diff_mean_usg_pct = h_mean_usg_pct - a_mean_usg_pct,
    diff_mean_ast_pct = h_mean_ast_pct - a_mean_ast_pct,
    diff_max_usg_pct = h_max_usg_pct - a_max_usg_pct,
    diff_avg_mp_starter = h_avg_mp_starter - a_avg_mp_starter,
    diff_avg_mp_bench = h_avg_mp_bench - a_avg_mp_bench,
    diff_pnts_by_starters = h_pnts_by_starters - a_pnts_by_starters,
    diff_pnts_by_bench = h_pnts_by_bench - a_pnts_by_bench,
    diff_sharp_shooters = h_sharp_shooters - a_sharp_shooters,
    diff_paint_specialists = h_paint_specialists - a_paint_specialists,
    diff_game_score_metric = h_game_score_metric - a_game_score_metric,
  ) %>%
  # Select relevant columns
  select(
    season,
    nbagameid,
    gamedate,
    h_team,
    a_team,
    is_win,
    starts_with("diff_"),
    h_cumulative_unique_lineups,
    a_cumulative_unique_lineups
  ) %>%
  mutate(
    is_win = as.factor(is_win)
  )

Feature EDA

# Feature EDA ---------------------------

# Box score metrics P1
features %>%
  mutate(is_win = ifelse(is_win == 1,"Win","Loss")) %>%
  pivot_longer(diff_fg2made:diff_reboundchance, names_to = "stat", values_to = "value") %>%
  ggplot(aes(is_win, value, color = is_win)) +
  geom_boxplot(alpha = 0.4) +
  facet_wrap(~stat, scales = "free_y", nrow = 4, ncol = 4) +
  labs(y = NULL, color = NULL, fill = NULL)

There are no distinct differences in game outcome across these metrics. Intuition tells us that these metrics will not be good discriminators in our model.

# Box score metrics P2
features %>%
  mutate(is_win = ifelse(is_win == 1,"Win","Loss")) %>%
  pivot_longer(diff_assists:diff_shotattemptpoints, names_to = "stat", values_to = "value") %>%
  ggplot(aes(is_win, value, color = is_win)) +
  geom_boxplot(alpha = 0.4) +
  facet_wrap(~stat, scales = "free_y", nrow = 3, ncol = 6) +
  labs(y = NULL, color = NULL, fill = NULL)

diff_blocksagainst is the only metric with a clear visual difference between game outcomes. Again, the difference box score metrics appear to mot be good discriminators of wins and losses.

# Advanced team and player metrics
features %>%
  mutate(is_win = ifelse(is_win == 1,"Win","Loss")) %>%
  pivot_longer(diff_ppa:diff_max_usg_pct, names_to = "stat", values_to = "value") %>%
  ggplot(aes(is_win, value, color = is_win)) +
  geom_boxplot(alpha = 0.4) +
  facet_wrap(~stat, scales = "free_y", nrow = 3, ncol = 6) +
  labs(y = NULL, color = NULL, fill = NULL)

There is no clear differences across most of these metrics except for diff_max_usg_pct and diff_ntrg. It appears that these advanced metrics overall will not be good discriminators of wins and loses in our model.

# Rating feature
features %>%
  mutate(is_win = ifelse(is_win == 1,"Win","Loss")) %>%
  pivot_longer(diff_rating, names_to = "stat", values_to = "value") %>%
  ggplot(aes(is_win, value, color = is_win)) +
  geom_boxplot(alpha = 0.4) +
  facet_wrap(~stat, scales = "free_y", nrow = 1) +
  labs(y = NULL, color = NULL, fill = NULL)

We can see the biggest difference in game outcome in the diff_rating metric. This appears to be our strongest discriminator of wins and loses thus far.

# Cumulative lineup features
features %>%
  mutate(is_win = ifelse(is_win == 1,"Win","Loss")) %>%
  pivot_longer(h_cumulative_unique_lineups:a_cumulative_unique_lineups, names_to = "stat", values_to = "value") %>%
  ggplot(aes(is_win, value, color = is_win)) +
  geom_boxplot(alpha = 0.4) +
  facet_wrap(~stat, scales = "free_y", nrow = 2) +
  labs(y = NULL, color = NULL, fill = NULL)

We can see clear differences in game outcome between both a_cumulative_unique_lineups and h_cumulative_unique_lineups. This is an early indication that the number of unique starting lineups cumulative across seasons is a good discriminator of wins and loses for our model, but diff_rating is still on top.

# Misc Features
features %>%
  mutate(is_win = ifelse(is_win == 1,"Win","Loss")) %>%
  pivot_longer(diff_avg_mp_bench:diff_game_score_metric, names_to = "stat", values_to = "value") %>%
  ggplot(aes(is_win, value, color = is_win)) +
  geom_boxplot(alpha = 0.4) +
  facet_wrap(~stat, scales = "free_y", nrow = 3) +
  labs(y = NULL, color = NULL, fill = NULL)

We can see clear differences in game outcome between diff_game_score_metric and diff_paint_specialists but less so in diff_pnts_by_starters and diff_pnts_by_bench.

Model Development

Model Preparation: Workflow/Specification

# Model Preparation ---------------------------

# Create Splits (80-20)
splits <- initial_split(
  features,
  prop = 0.8
)

# Create pre-processing recipe
preprocessing_recipe <-
  recipe(is_win ~ ., data = splits %>% training()) %>%
  # Removes unnecessary columns
  step_rm(season, nbagameid, gamedate, h_team, a_team) %>%
  # Removes observations (rows of data) if they contain NA or NaN values
  step_naomit(everything(), skip= TRUE) %>%
  # Removes any numeric variables that have zero variance
  step_zv(all_numeric(), -all_outcomes()) %>%
  # Remove highly correlated variables
  step_corr(all_numeric(), threshold = 0.8, method = "spearman") 

# Observe the recipe on features
features_proprocessed <- preprocessing_recipe %>%
  prep() %>%
  bake(splits %>% training())

# Set Seed for reproducibility 
set.seed(123)
feature_folds <- vfold_cv(training(splits), strata = is_win, v = 5)

# Create XGB boost classification model spec
xgb_spec <- boost_tree(
  mode = "classification",
  trees = 500,
  tree_depth = tune(), min_n = tune(),
  loss_reduction = tune(),                     # first three: model complexity
  sample_size = tune(), mtry = tune(),         # randomness
  learn_rate = tune()                          # step size
) %>%
  set_engine("xgboost")

# Display model specification
xgb_spec
## Boosted Tree Model Specification (classification)
## 
## Main Arguments:
##   mtry = tune()
##   trees = 500
##   min_n = tune()
##   tree_depth = tune()
##   learn_rate = tune()
##   loss_reduction = tune()
##   sample_size = tune()
## 
## Computational engine: xgboost
# Create model workflow
xgb_wf <- workflow() %>%
  add_recipe(preprocessing_recipe) %>%
  add_model(xgb_spec)

Tune and Train Model

# Hyper-parameter tuning ---------------------------

# Use anova race to tune the grid and save time on poor performing parameter combinations
doParallel::registerDoParallel()

set.seed(345)
xgb_res <- tune_race_anova(
  xgb_wf,
  resamples = feature_folds,
  grid = 5,#30,
  metrics = metric_set(roc_auc),
  control = control_race(verbose_elim = TRUE,save_pred=TRUE)
)
## ℹ Evaluating against the initial 3 burn-in resamples.
## i Creating pre-processing data to finalize unknown parameter: mtry
## 
## ℹ Racing will maximize the roc_auc metric.
## ℹ Resamples are analyzed in a random order.
## ℹ Fold3: 2 eliminated; 3 candidates remain.
## 
## ℹ Fold5: 0 eliminated; 3 candidates remain.
# Plot the parameter combination race
plot_race(xgb_res)

Using tune_race_anova we can eliminated combinations of parameters that are low performing and only use compute on parameter combinations that are high performing. We can see that only 3 parameter combinations made it to the 5th and final race stage.

# Collect metrics for the model training 
collect_metrics(xgb_res)
## # A tibble: 3 × 12
##    mtry min_n tree_depth learn_rate loss_reduction sample_size .metric
##   <int> <int>      <int>      <dbl>          <dbl>       <dbl> <chr>  
## 1    20     9          2    0.00225       0.000515       0.353 roc_auc
## 2    32    37         12    0.00389       0.00643        0.586 roc_auc
## 3    14    23         13    0.0168        8.39           0.172 roc_auc
## # ℹ 5 more variables: .estimator <chr>, mean <dbl>, n <int>, std_err <dbl>,
## #   .config <chr>
# Show best combination of parameters
show_best(xgb_res, metric = "roc_auc")
## # A tibble: 3 × 12
##    mtry min_n tree_depth learn_rate loss_reduction sample_size .metric
##   <int> <int>      <int>      <dbl>          <dbl>       <dbl> <chr>  
## 1    20     9          2    0.00225       0.000515       0.353 roc_auc
## 2    32    37         12    0.00389       0.00643        0.586 roc_auc
## 3    14    23         13    0.0168        8.39           0.172 roc_auc
## # ℹ 5 more variables: .estimator <chr>, mean <dbl>, n <int>, std_err <dbl>,
## #   .config <chr>

Refit & Evaluate Model

# Refit best model on training data and assess performance on test set
xgb_last <-
  xgb_wf %>%
  finalize_workflow(select_best(xgb_res,metric = "roc_auc")) %>%
  last_fit(splits)

# Show metrics
collect_metrics(xgb_last)
## # A tibble: 3 × 4
##   .metric     .estimator .estimate .config             
##   <chr>       <chr>          <dbl> <chr>               
## 1 accuracy    binary         0.644 Preprocessor1_Model1
## 2 roc_auc     binary         0.688 Preprocessor1_Model1
## 3 brier_class binary         0.225 Preprocessor1_Model1
# Capture training predictions
xgb_last_pred <- collect_predictions(xgb_last)

# Display output
xgb_last_pred
## # A tibble: 2,481 × 7
##    .pred_class .pred_0 .pred_1 id                .row is_win .config            
##    <fct>         <dbl>   <dbl> <chr>            <int> <fct>  <chr>              
##  1 1             0.459   0.541 train/test split     6 1      Preprocessor1_Mode…
##  2 1             0.365   0.635 train/test split    10 1      Preprocessor1_Mode…
##  3 0             0.501   0.499 train/test split    20 1      Preprocessor1_Mode…
##  4 1             0.462   0.538 train/test split    29 1      Preprocessor1_Mode…
##  5 0             0.557   0.443 train/test split    30 0      Preprocessor1_Mode…
##  6 1             0.392   0.608 train/test split    33 1      Preprocessor1_Mode…
##  7 1             0.437   0.563 train/test split    46 0      Preprocessor1_Mode…
##  8 1             0.432   0.568 train/test split    49 1      Preprocessor1_Mode…
##  9 0             0.553   0.447 train/test split    63 1      Preprocessor1_Mode…
## 10 1             0.371   0.629 train/test split    66 1      Preprocessor1_Mode…
## # ℹ 2,471 more rows
# Extract variable importance and plot
xgb_fit <- extract_fit_parsnip(xgb_last)
vip(xgb_fit, num_features = 15)

Plotting variable importance allows us to quantify how much a given feature in our model is explaining game outcome. In this case our diff_rating variable is explaining our predictor the most, however also Hollinger’s game score metric and the maximum usage % player percentage on the team were also important variables. The cumulative number of lineups for the home and difference in points by starters between teams were less important but appear in the top 5 for variable importance in our model.

# Evaluate ROC curve
xgb_last %>%
  collect_predictions() %>%
  roc_curve(is_win, .pred_1) %>%
  ggplot(aes(x = 1 - specificity, y = sensitivity)) +
  geom_line(linewidth = 1.5, color = "midnightblue") +
  geom_abline(
    lty = 2, alpha = 0.5,
    color = "gray50",
    linewidth = 1.2
  )

ROC curves allow us to assess the classification performance of our model i.e. how well we predict game outcomes (as wins). It uses a graphical representation of two variables, sensitivity (true positive rate) and 1- specificity (false positive rate). We can see that our model was able to correctly identify when the predicted team will win and not too often incorrectly predicting a loss when the outcome was a win.

Playoff Simulation

Run Simulations

# 2024 Playoff Simulation ---------------------------

# Set playoff seeding by conference and by league
playoff_team_seeding
##    team_name seed league_seed
## 1        BOS    1           1
## 2        MIA    8          16
## 3        CLE    4          11
## 4        ORL    5          12
## 5        MIL    3           8
## 6        IND    6          15
## 7        NYK    2           7
## 8        PHI    7          14
## 9        OKC    1           2
## 10       NOP    8           9
## 11       LAC    4           5
## 12       DAL    5           6
## 13       MIN    3           4
## 14       PHX    6          10
## 15       DEN    2           3
## 16       LAL    7          13
# Set number of simulations
nr_sims <- 1000

# Run the sims and get time elapsed
system.time(
  sim_results <- playoff_sim(nr_sims, xgb_last, playoff_team_seeding)
)
## [1] "All sims complete!"
##    user  system elapsed 
## 520.055   6.930 527.386
# Set sim outputs to variables
all_results <- sim_results$results
all_final_series <- sim_results$final_series 

# Show the number of simulations that resulted in each team being eliminated at a given stage
results_extended <- 
  all_final_series %>% 
  group_by(round_made, team_name) %>% 
  summarise(
    total = n(),
    .groups = 'drop'
  ) %>% 
  pivot_wider(
    names_from = round_made,
    values_from = c(total),
    values_fill = 0
  )

# Display table
results_extended
## # A tibble: 16 × 6
##    team_name   `1`   `2`   `3`   `4`   `5`
##    <chr>     <int> <int> <int> <int> <int>
##  1 BOS         753   509   333   198   198
##  2 CLE         514   195    90    32    32
##  3 DAL         610   336   178   103   103
##  4 DEN         409   177    71    36    36
##  5 IND         672   381   198    90    90
##  6 LAC         390   155    58    25    25
##  7 LAL         591   357   201   107   107
##  8 MIA         247   125    59    18    18
##  9 MIL         328   135    56    25    25
## 10 MIN         584   297   148    86    86
## 11 NOP         315   128    51    23    23
## 12 NYK         520   257    91    35    35
## 13 OKC         685   381   225   127   127
## 14 ORL         486   171    67    28    28
## 15 PHI         480   227   106    35    35
## 16 PHX         416   169    68    32    32

Visualise the path to NBA Finals

# Visualise the playoff bracket simulations ---------------------------

# Processing and summarising results
results_proportion <- results_extended %>%
  group_by(team_name) %>%
  reframe(across(c(`1`, `2`, `3`, `4`, `5`), 
                ~ .x / nr_sims,
                .names = "Round {col}")) %>%
  select(-`Round 5`) %>%
  rename(
     'Conference Finals' = `Round 3`,
     'Finals' = `Round 4`,
    ) %>%
  arrange(desc(Finals), desc(`Conference Finals`))

# Join the logo URLs with the main data frame
results_proportion <- results_proportion %>%
  left_join(logo_mapping, by = "team_name")

# Create probabilities table for advancing rounds
results_proportion %>%
  select(-team_name) %>%
  select(logo_url,`Round 1`:Finals) %>%
  gt() %>%
  tab_header(
    title = "2023 NBA Playoff Simulations",
    subtitle = "The % chance that a team wins that round*",
  ) %>%
  fmt_percent(
    columns = c("Round 1", "Round 2", "Conference Finals", "Finals"),
    decimals = 1
  ) %>%
  cols_label(
    "logo_url" = "Team",
    "Round 1" = "Rnd 1",
    "Round 2" = "Rnd 2",
    "Conference Finals" = "Conf. Finals",
    "Finals" = "Finals",
  ) %>%
  tab_options(table.width = pct(50)) %>%
  gt_img_rows(logo_url) %>%
  tab_source_note("*The proportion of simulated playoff brackets where a team wins or advances on from that round.") %>%
  gt_theme_538() %>%
  tab_style(
    style = cell_text(align = 'center'),
    locations = cells_column_labels(columns = everything())
  ) %>%
  tab_style(
    style = cell_text(align = 'center'),
    locations = cells_body(columns = everything())
  )
2023 NBA Playoff Simulations
The % chance that a team wins that round*
Team Rnd 1 Rnd 2 Conf. Finals Finals
75.3% 50.9% 33.3% 19.8%
68.5% 38.1% 22.5% 12.7%
59.1% 35.7% 20.1% 10.7%
61.0% 33.6% 17.8% 10.3%
67.2% 38.1% 19.8% 9.0%
58.4% 29.7% 14.8% 8.6%
40.9% 17.7% 7.1% 3.6%
48.0% 22.7% 10.6% 3.5%
52.0% 25.7% 9.1% 3.5%
51.4% 19.5% 9.0% 3.2%
41.6% 16.9% 6.8% 3.2%
48.6% 17.1% 6.7% 2.8%
39.0% 15.5% 5.8% 2.5%
32.8% 13.5% 5.6% 2.5%
31.5% 12.8% 5.1% 2.3%
24.7% 12.5% 5.9% 1.8%
*The proportion of simulated playoff brackets where a team wins or advances on from that round.

Series Predictions

# 2024 Playoff Series predictor ---------------------------

# Point estimate example
get_series_prediction_2024("Finals","DAL","BOS", "Point Estimate")
Series Win- Point Estimate
Simulated Finals series games between DAL and BOS
Team Win % Avg # of Games
52.8% 5.9
47.2% 5.8
The % chance that a team wins a series in a given, average number of games.
# 2024 Playoff Series predictor ---------------------------

# Probabilistic
get_series_prediction_2024("Finals","DAL","BOS", "Probabilistic")
Series Win- Probabilistic
Simulated Finals series games between DAL and BOS
Team Win % # of Games
100.0% 4
61.1% 5
38.9% 5
61.5% 6
38.5% 6
50.0% 7
50.0% 7
The % chance that a team wins a series when playing a given total number of games.

Part 3 – Finding Insights from Your Model

Classifying teams with a competitive window (last 2 seasons) as back luck or due to unknown causes.

# Find teams that made 23 and 24 season playoffs and that under performed in Simulated 2024 Playoffs
p3_playoff_teams <- team_data %>%
  filter((season >= 2022 | season <= 2023) & gametype == 4) %>%
  distinct(off_team) %>%
  rename("team_name" = "off_team")

glimpse(p3_playoff_teams)
## Rows: 30
## Columns: 1
## $ team_name <chr> "DAL", "MIA", "MEM", "OKC", "SAS", "LAL", "PHX", "CHI", "BOS…
# Prediction for Knicks vs Indiana in Round 2 2024 Playoffs
get_series_prediction_2024("Round 2","IND","NYK", "Probabilistic")
Series Win- Probabilistic
Simulated Round 2 series games between IND and NYK
Team Win % # of Games
61.0% 4
39.0% 4
59.5% 5
40.5% 5
56.8% 6
43.2% 6
50.0% 7
50.0% 7
The % chance that a team wins a series when playing a given total number of games.
# Prediction for Denver vs Lakers in Round 2 2024 Playoffs
get_series_prediction_2024("Round 1","DEN","LAL", "Point Estimate")
Series Win- Point Estimate
Simulated Round 1 series games between DEN and LAL
Team Win % Avg # of Games
40.9% 5.9
59.1% 5.7
The % chance that a team wins a series in a given, average number of games.

By assessing the simulation output for the 2024 NBA Playoffs it appears that despite making the playoffs two consecutive seasons in a row both the New York Knicks and Lakers under performed against my models expectations.

The Knicks had more than a 50% chance of beating the Indiana Pacers to reach the Eastern Conference Finals when the series went to 5 and 7 games but ended up losing in 7 games to the Pacers. This particular example is a case of injuries affecting the outcome of the series. Despite the model predicting the Knicks had a better chance of winning the more games played in the series, 4 starting rotation players combined for only 3 games played due to injury. Without their expected, strongest playoff lineup the Knicks fell surprisingly at home in Game 7 to the Pacers. If given more data, specific metrics targeting starters minutes lost due to injuries not just missed games as well as other player health metrics, this might produce better predictions. given the injury history of a team before and during a playoff series.

The Los Angeles Lakers had greater than 50% chance of beating the Denver Nuggets to reach Round 2 of the playoffs in the West but lost in 5 games. Despite the Denver Nuggets being the previous seasons champion, the model still favoured the Lakers due to the lack in decay in the ratings for team strength since their Finals win in the year of the COVID-19 NBA bubble. The Denver Nuggets were clear favourites in this series and the model does not fully account for this. To fix this limitation in the model, an adjustment to the input parameters to the Glicko-2 rating system model that would reduce the numbers of (lost) games before rating decay begins would would be sufficient.